Stagger: A modern POS tagger for Swedish

نویسنده

  • Robert Östling
چکیده

The field of Part of Speech (POS) tagging has made slow but steady progress during the last decade, though many of the new methods developed have not previously been applied to Swedish. I present a new system, based on the Averaged Perceptron algorithm and semi-supervised learning, that is more accurate than previous Swedish POS taggers. Furthermore, a new version of the Stockholm-Umeå Corpus is presented, whose more consistent annotation leads to significantly lower error rates for the POS tagger. Finally, a new, freely available annotated corpus of Swedish blog posts is presented and used to evaluate the tagger’s accuracy on this increasingly important genre. Details of the evaluation are presented throughout, to ensure easy comparison with future results.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Tagging the Past: Experiments using the Saga Corpus

There is an increasing interest in the NLP community in developing tools for annotating historical data, for example, to facilitate research in the field of corpus linguistics. In this work, we experiment with several PoS taggers using a sub-corpus of the Icelandic Saga Corpus. This is carried out in three main steps. First, we evaluate taggers, which were trained on Modern Icelandic, when tagg...

متن کامل

A Neural Model for Part-of-Speech Tagging in Historical Texts

Historical texts are challenging for natural language processing because they differ linguistically from modern texts and because of their lack of orthographical and grammatical standardisation. We use a character-level neural network to build a part-of-speech (POS) tagger that can process historical data directly without requiring a separate spelling normalisation stage. Its performance in a S...

متن کامل

Analysing Inconsistencies and Errors in PoS Tagging in two Icelandic Gold Standards

This paper describes work in progress. We experiment with training a state-of-the-art tagger, Stagger, on a new gold standard, MIM-GOLD, for the PoS tagging of Icelandic. We compare the results to results obtained using a previous gold standard, IFD. Using MIM-GOLD, tagging accuracy is considerably lower, 92.76% compared to 93.67% accuracy for IFD. We analyze and classify the errors made by the...

متن کامل

Some applications of a statistical tagger for Swedish

We will brie y describe a part-of-speech (POS) tagger for Swedish and discuss some applications: rule-based and probabilistic grammar checking, word prediction and keyword extraction. In POS tagging of a text, each word and punctuation mark in the text is assigned a morphosyntactic tag. We have designed and implemented a tagger based on a second order Hidden Markov Model [1]. Given a sequence o...

متن کامل

Big is beautiful Bootstrapping a PoS tagger for Swedish

A statistical part-of-speech tagger trained on a one-million word Swedish corpus with validated tags was used to tag two considerably larger untagged corpora (≈ 78 and 20 million words, respectively) to bootstrap new, improved, tagger models. The new taggers all showed better accuracy both for seen and unseen words, and the best tagger had 97.02% overall accuracy evaluated on the original corpu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012